Maximum likelihood (MLE) and method of moments (MM) are two common methods for constructing a model.
I will explain to the reader how one might use MLE and MM to model (a) Glycohemoglobin and (b) Height of adult females. The data will be from National Health and Nutrition Examination Survey 2009-2010 (NHANES), available from the Hmisc package. I will compare and contrast the two methods in addition to comparing and contrasting the choice of underlying distribution.
#Import data
require(dplyr)
Hmisc::getHdata(nhgh)
d1 <- nhgh %>%
filter(sex == "female") %>%
filter(age >= 18) %>%
select(gh, ht) %>%
filter(1:n()<=1000)
#HT,Normal, MLE
ll <- function(params, data) sum(dnorm(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$ht)
g <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(x[i],7),d1$ht)
}
out
}
h <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(160,x[i]),d1$ht)
}
out
}
In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.
require(stats4)
nLL <- function(mean, sd){
fs <- dnorm(
x = d1$ht
, mean = mean
, sd = sd
, log = TRUE
)
-sum(fs)
}
param_hat <- mle(
nLL
, start = list(mean = 160, sd = 5)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(param_hat), absVal = FALSE)
In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.
coef(param_hat)
## mean sd
## 160.741906 7.316505
This is the specific value of the two parameters.
hist(d1$ht,breaks = 100, freq = FALSE)
curve(dnorm(x,mean=coef(param_hat)[1],sd=coef(param_hat)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution.
plot(ecdf(d1$ht))
curve(pnorm(x,mean=coef(param_hat)[1],sd=coef(param_hat)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qnorm(qs,mean=coef(param_hat)[1],sd=coef(param_hat)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max fits a straight line with y=x. We can say that the sample quantile fits the theoretical quantile.
qnorm(0.5,mean=coef(param_hat)[1],sd=coef(param_hat)[2])
## [1] 160.7419
The estimated Median is 160.7419064. We use qnorm to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rnorm(M*N,mean=coef(param_hat)[1],sd=coef(param_hat)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
#hist(sample_dist, breaks = 100)
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 160.1659 161.2980
#abline(v = quantile(sample_dist,c(0.05/2, 1 - 0.05/2)), col = "blue", lty = 2)
The Range of middle 95% of Samp Dist is from 160.1658828 to 161.2979761. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
#HT,Gamma, MLE
ll <- function(params, data) sum(dgamma(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$ht)
g <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(x[i],0.3),d1$ht)
}
out
}
h <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(480,x[i]),d1$ht)
}
out
}
In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.
require(stats4)
nLL_gamma <- function(shape, scale){
fs <- dgamma(
x = d1$ht
, shape = shape
, scale = scale
, log = TRUE
)
-sum(fs)
}
param_hat_gamma <- mle(
nLL_gamma
, start = list(shape = 480, scale = 0.3)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(param_hat_gamma), absVal = FALSE)
In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.
coef(param_hat_gamma)
## shape scale
## 480.0000205 0.3348802
This is the specific value of the two parameters.
hist(d1$ht,breaks = 100, freq = FALSE)
curve(dgamma(x,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution.
plot(ecdf(d1$ht))
curve(pgamma(x,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qgamma(qs,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max fits a straight line with y=x. We can say that the sample quantile fits the theoretical quantile.
qgamma(0.5,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2])
## [1] 160.6309
The estimated Median is 160.6308885. We use qgamma to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rgamma(M*N,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 160.0639 161.1975
The Range of middle 95% of Samp Dist is from 160.0638804 to 161.1975156. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
#HT,Gamma, MLE
ll <- function(params, data) sum(dweibull(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$ht)
g <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(x[i],165),d1$ht)
}
out
}
h <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(22,x[i]),d1$ht)
}
out
}
In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.
require(stats4)
nLL_weibull <- function(shape, scale){
fs <- dweibull(
x = d1$ht
, shape = shape
, scale = scale
, log = TRUE
)
-sum(fs)
}
param_hat_weibull <- mle(
nLL_weibull
, start = list(shape = 22, scale = 165)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(param_hat_weibull), absVal = FALSE)
In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.
coef(param_hat_weibull)
## shape scale
## 21.85396 164.24718
This is the specific value of the two parameters.
hist(d1$ht,breaks = 100, freq = FALSE)
curve(dweibull(x,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution.
plot(ecdf(d1$ht))
curve(pweibull(x,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qweibull(qs,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max doesn’t fit a straight line with y=x perfectly. We maynot say that the sample quantile fits the theoretical quantile very well in the beginning of the figure.
qweibull(0.5,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2])
## [1] 161.5156
The estimated Median is 161.5155603. We use qweibull to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rweibull(M*N,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 160.8373 162.1827
The Range of middle 95% of Samp Dist is from 160.8372666 to 162.1826811. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
In this part, we use MLE to model Height of adult females and we use three different distributions. The content above answers questions in Background part.
According to the feature of normal distribution.E[X]=mean(sample) and V[X]= Var(sample). Thus, we can conclude two parameters below.
mm.mean=mean(d1$ht)
mm.sd=sd(d1$ht)
This is the specific value of the two parameters.
hist(d1$ht,breaks = 100,freq = F)
curve(dnorm(x,mm.mean,mm.sd),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution.
plot(ecdf(d1$ht))
curve(pnorm(x,mm.mean,mm.sd),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qnorm(qs,mm.mean,mm.sd)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max fits a straight line with y=x. We can say that the sample quantile fits the theoretical quantile.
qnorm(0.5,mean=mm.mean,sd=mm.sd)
## [1] 160.7419
The estimated Median is 160.7419. We use qnorm to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rnorm(M*N,mean=mm.mean,sd=mm.sd) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 160.1787 161.3185
The Range of middle 95% of Samp Dist is from 160.178736 to 161.3185258. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
According to the feature of Gamma distribution.E[X]=shape∗scale and V[X]=shape∗scale^2. Thus, we can conclude two parameters below. scale=mean(sample)/var(sample) and shape= mean(sample)^2 / mean(sample).
mm_shape_gamma=mean(d1$ht)^2/var(d1$ht)
mm_scale_gamma=var(d1$ht)/mean(d1$ht)
This is the specific value of the two parameters.
hist(d1$ht,breaks = 100,freq = F)
curve(dgamma(x,shape=mm_shape_gamma,scale=mm_scale_gamma),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution.
plot(ecdf(d1$ht))
curve(pgamma(x,shape=mm_shape_gamma,scale=mm_scale_gamma),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qgamma(qs,shape=mm_shape_gamma,scale=mm_scale_gamma)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max fits a straight line with y=x. We can say that the sample quantile fits the theoretical quantile.
qgamma(0.5,shape=mm_shape_gamma,scale=mm_scale_gamma)
## [1] 160.6308
The estimated Median is 160.630794. We use qgamma to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rgamma(M*N,shape=mm_shape_gamma,scale=mm_scale_gamma) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 160.0672 161.1875
The Range of middle 95% of Samp Dist is from 160.0671567 to 161.1875228. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
According to the feature of Weibull distribution.E[X]= E(X) = b Γ(1 + 1/a) and V[X]=b^2 * (Γ(1 + 2/a) - (Γ(1 + 1/a))^2). These are not very good representations, we constructed the following equation.
#Define mean of Weibull Distribution
mean.weib = function(lambda,k){
lambda*gamma(1+1/k)
}
#Define variance of Weibull Distribution
var.weib = function(lambda,k){
lambda^2*(gamma(1+2/k)-(gamma(1+1/k))^2)
}
#Define lambda
lambda = function(samp_mean,k){
samp_mean/gamma(1+1/k)
}
#Define mean of Weibull Distribution using sample mean
var.weib = function(samp_mean,k){
lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)
}
#Define variance of Weibull Distribution using sample mean
var.weib = function(samp_mean,k,samp_var){
lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)-samp_var
}
#Build function to calculate two parameters in Weibull Distribution
mm.opt = optimize(f = function(x){
abs(var.weib(k=x,samp_mean = mean(d1$ht),samp_var = var(d1$ht)))
},lower=10,upper = 100)
mm.weib.k = mm.opt$min
mm.weib.lambda = lambda(samp_mean = mean(d1$ht),k=mm.weib.k)
mm.weib.k
## [1] 27.45942
mm.weib.lambda
## [1] 163.9807
This is the specific value of the two parameters.
hist(d1$ht,breaks = 100,freq = F)
curve(dweibull(x,shape=mm.weib.k,scale=mm.weib.lambda),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution.
plot(ecdf(d1$ht))
curve(pweibull(x,shape=mm.weib.k,scale=mm.weib.lambda),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qweibull(qs,shape=mm.weib.k,scale=mm.weib.lambda)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max doesn’t fit a straight line with y=x perfectly. We maynot say that the sample quantile fits the theoretical quantile very well in the figure.
qweibull(0.5,shape=mm.weib.k,scale=mm.weib.lambda)
## [1] 161.8065
The estimated Median is 161.8065244. We use qweibull to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rweibull(M*N,shape=mm.weib.k,scale=mm.weib.lambda) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
#hist(sample_dist, breaks = 100)
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 161.2808 162.3238
#abline(v = quantile(sample_dist,c(0.05/2, 1 - 0.05/2)), col = "blue", lty = 2)
The Range of middle 95% of Samp Dist is from 161.2808117 to 162.3238118. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
In this part, we use MM to model Height of adult females and we use three different distributions. The content above answers questions in Background part.
#HT,Normal, MLE
ll <- function(params, data) sum(dnorm(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$gh)
g <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(x[i],1),d1$gh)
}
out
}
h <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(5.7,x[i]),d1$gh)
}
out
}
In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.
require(stats4)
nLL <- function(mean, sd){
fs <- dnorm(
x = d1$gh
, mean = mean
, sd = sd
, log = TRUE
)
-sum(fs)
}
param_hat <- mle(
nLL
, start = list(mean = 5.7, sd = 1)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(param_hat), absVal = FALSE)
In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.
coef(param_hat)
## mean sd
## 5.724600 1.051721
This is the specific value of the two parameters.
hist(d1$gh,breaks = 100, freq = FALSE)
curve(dnorm(x,mean=coef(param_hat)[1],sd=coef(param_hat)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.
plot(ecdf(d1$gh))
curve(pnorm(x,mean=coef(param_hat)[1],sd=coef(param_hat)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qnorm(qs,mean=coef(param_hat)[1],sd=coef(param_hat)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.
qnorm(0.5,mean=coef(param_hat)[1],sd=coef(param_hat)[2])
## [1] 5.7246
The estimated Median is 5.7246. We use qnorm to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rnorm(M*N,mean=coef(param_hat)[1],sd=coef(param_hat)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
#hist(sample_dist, breaks = 100)
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 5.644918 5.807284
#abline(v = quantile(sample_dist,c(0.05/2, 1 - 0.05/2)), col = "blue", lty = 2)
The Range of middle 95% of Samp Dist is from 5.6449183 to 5.8072845. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
#HT,Gamma, MLE
ll <- function(params, data) sum(dgamma(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$gh)
g <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(x[i],0.14),d1$gh)
}
out
}
h <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(40,x[i]),d1$gh)
}
out
}
In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.
require(stats4)
nLL_gamma <- function(shape, scale){
fs <- dgamma(
x = d1$gh
, shape = shape
, scale = scale
, log = TRUE
)
-sum(fs)
}
param_hat_gamma <- mle(
nLL_gamma
, start = list(shape = 40, scale = 0.14)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(param_hat_gamma), absVal = FALSE)
In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.
coef(param_hat_gamma)
## shape scale
## 40.7065036 0.1406358
This is the specific value of the two parameters.
hist(d1$gh,breaks = 100, freq = FALSE)
curve(dgamma(x,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.
plot(ecdf(d1$gh))
curve(pgamma(x,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qgamma(qs,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.
qgamma(0.5,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2])
## [1] 5.677983
The estimated Median is 5.677983. We use qgamma to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rgamma(M*N,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 5.607436 5.747332
The Range of middle 95% of Samp Dist is from 5.6074365 to 5.7473322. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
#HT,Gamma, MLE
ll <- function(params, data) sum(dweibull(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$gh)
g <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(x[i],6.2),d1$gh)
}
out
}
h <- function(x) {
out <- x
for(i in seq_along(x)){
out[i] <- ll(c(4.1,x[i]),d1$gh)
}
out
}
In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.
require(stats4)
nLL_weibull <- function(shape, scale){
fs <- dweibull(
x = d1$gh
, shape = shape
, scale = scale
, log = TRUE
)
-sum(fs)
}
param_hat_weibull <- mle(
nLL_weibull
, start = list(shape = 4.1, scale = 6.2)
, method = "L-BFGS-B"
, lower = c(0, 0.01)
)
par(mfrow = c(1,2)); plot(profile(param_hat_weibull), absVal = FALSE)
In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.
coef(param_hat_weibull)
## shape scale
## 4.125254 6.173884
This is the specific value of the two parameters.
hist(d1$gh,breaks = 100, freq = FALSE)
curve(dweibull(x,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.
plot(ecdf(d1$gh))
curve(pweibull(x,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qweibull(qs,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.
qweibull(0.5,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2])
## [1] 5.64902
The estimated Median is 5.6490199. We use qweibull to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rweibull(M*N,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 5.526716 5.772334
The Range of middle 95% of Samp Dist is from 5.526716 to 5.772334. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
In this part, we use MLE to model Glycohemoglobin of adult females and we use three different distributions. The content above answers questions in Background part.
According to the feature of normal distribution.E[X]=mean(sample) and V[X]= Var(sample). Thus, we can conclude two parameters below.
mm.mean=mean(d1$gh)
mm.sd=sd(d1$gh)
This is the specific value of the two parameters.
hist(d1$gh,breaks = 100,freq = F)
curve(dnorm(x,mm.mean,mm.sd),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.
plot(ecdf(d1$gh))
curve(pnorm(x,mm.mean,mm.sd),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qnorm(qs,mm.mean,mm.sd)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.
qnorm(0.5,mean=mm.mean,sd=mm.sd)
## [1] 5.7246
The estimated Median is 5.7246. We use qnorm to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rnorm(M*N,mean=mm.mean,sd=mm.sd) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 5.645479 5.808015
The Range of middle 95% of Samp Dist is from 5.6454789 to 5.8080146. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
According to the feature of Gamma distribution.E[X]=shape∗scale and V[X]=shape∗scale^2. Thus, we can conclude two parameters below. scale=mean(sample)/var(sample) and shape= mean(sample)^2 / mean(sample).
mm_shape_gamma=mean(d1$gh)^2/var(d1$gh)
mm_scale_gamma=var(d1$gh)/mean(d1$gh)
This is the specific value of the two parameters.
hist(d1$gh,breaks = 100,freq = F)
curve(dgamma(x,shape=mm_shape_gamma,scale=mm_scale_gamma),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.
plot(ecdf(d1$gh))
curve(pgamma(x,shape=mm_shape_gamma,scale=mm_scale_gamma),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qgamma(qs,shape=mm_shape_gamma,scale=mm_scale_gamma)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.
qgamma(0.5,shape=mm_shape_gamma,scale=mm_scale_gamma)
## [1] 5.660259
The estimated Median is 5.6602591. We use qgamma to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rgamma(M*N,shape=mm_shape_gamma,scale=mm_scale_gamma) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 5.579084 5.744042
The Range of middle 95% of Samp Dist is from 5.5790839 to 5.7440416. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
According to the feature of Weibull distribution.E[X]= E(X) = b Γ(1 + 1/a) and V[X]=b^2 * (Γ(1 + 2/a) - (Γ(1 + 1/a))^2). These are not very good representations, we constructed the following equation.
#Define mean of Weibull Distribution
mean.weib = function(lambda,k){
lambda*gamma(1+1/k)
}
#Define variance of Weibull Distribution
var.weib = function(lambda,k){
lambda^2*(gamma(1+2/k)-(gamma(1+1/k))^2)
}
#Define lambda
lambda = function(samp_mean,k){
samp_mean/gamma(1+1/k)
}
#Define mean of Weibull Distribution using sample mean
var.weib = function(samp_mean,k){
lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)
}
#Define variance of Weibull Distribution using sample mean
var.weib = function(samp_mean,k,samp_var){
lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)-samp_var
}
#Build function to calculate two parameters in Weibull Distribution
mm.opt = optimize(f = function(x){
abs(var.weib(k=x,samp_mean = mean(d1$gh),samp_var = var(d1$gh)))
},lower=10,upper = 100)
mm.weib.k = mm.opt$min
mm.weib.lambda = lambda(samp_mean = mean(d1$gh),k=mm.weib.k)
mm.weib.k
## [1] 10.00008
mm.weib.lambda
## [1] 6.017337
This is the specific value of the two parameters.
hist(d1$gh,breaks = 100,freq = F)
curve(dweibull(x,shape=mm.weib.k,scale=mm.weib.lambda),col="red",lwd=6,add=T)
The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-7 interval
plot(ecdf(d1$gh))
curve(pweibull(x,shape=mm.weib.k,scale=mm.weib.lambda),col="red",lwd=6,add=T)
The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.
qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qweibull(qs,shape=mm.weib.k,scale=mm.weib.lambda)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)
Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.
qweibull(0.5,shape=mm.weib.k,scale=mm.weib.lambda)
## [1] 5.800788
The estimated Median is 5.8007881. We use qweibull to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.
M <- 5000
N <- 1000
out <- rweibull(M*N,shape=mm.weib.k,scale=mm.weib.lambda) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)
The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.
#hist(sample_dist, breaks = 100)
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
## 2.5% 97.5%
## 5.748073 5.850781
#abline(v = quantile(sample_dist,c(0.05/2, 1 - 0.05/2)), col = "blue", lty = 2)
The Range of middle 95% of Samp Dist is from 5.748073 to 5.8507811. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.
In this part, we use MM to model Glycohemoglobin of adult females and we use three different distributions. The content above answers questions in Background part.